Hierarchical Cross-Modal Agent for
Robotics Vision-and-Language Navigation

Muhammad Zubair Irshad
Chih-Yao Ma
Zsolt Kira
Georgia Tech
FAIR/Georgia Tech
Georgia Tech
Accepted at IEEE International Conference on Robotics and Automation (ICRA), 2021

[Paper]
[Dataset]
[Poster]
[Github Code]



Deep Learning has revolutionized our ability to solve complex problems such as Vision-and-Language Navigation (VLN). This task requires the agent to navigate to a goal purely based on visual sensory inputs given natural language instructions. However, prior works formulate the problem as a navigation graph with a discrete action space. In this work, we lift the agent off the navigation graph and propose a more complex VLN setting in continuous 3D reconstructed environments. Our proposed setting, Robo-VLN, more closely mimics the challenges of real world navigation. Robo-VLN tasks have longer trajectory lengths, continuous action spaces, and challenges such as obstacles. We provide a suite of baselines inspired by state-of-the-art works in discrete VLN and show that they are less effective at this task. We further propose that \textit{decomposing the task} into specialized high- and low-level policies can more effectively tackle this task. With extensive experiments, we show that by using layered decision making, modularized training, and decoupling reasoning and imitation, our proposed Hierarchical Cross-Modal (HCM) agent outperforms existing baselines in all key metrics and sets a new benchmark for Robo-VLN.





This work studies the problem of object goal navigation which involves navigating to an instance of the given object category in unseen environments. End-to-end learning-based navigation methods struggle at this task as they are ineffective at exploration and long-term planning. We propose a modular system called, 'Goal-Oriented Semantic Exploration' which builds an episodic semantic map and uses it to explore the environment efficiently based on the goal object category. Empirical results in visually realistic simulation environments show that the proposed model outperforms a wide range of baselines including end-to-end learning-based methods as well as modular map-based methods and led to the winning entry of the CVPR-2020 Habitat ObjectNav Challenge. Ablation analysis indicates that the proposed model learns semantic priors of the relative arrangement of objects in a scene, and uses them to explore efficiently. Domain-agnostic module design allow us to transfer our model to a mobile robot platform and achieve similar performance for object goal navigation in the real-world.


Goal-Oriented Semantic Exploration

The proposed model consists of two modules, Semantic Mapping and Goal-Oriented Semantic Policy. The Semantic Mapping module builds a semantic map over time and the Goal-Oriented Semantic Policy selects a long-term goal based on the semantic map to reach the given object goal efficiently. A deterministic local policy based on analytical planners is used to take low-level navigation actions to reach the long-term goal.




Short Presentation

A short presentation at the CVPR 2020 Embodied AI Workshop describing our winning entry to the Habitat ObjectNav Challenge.



Demo Video



Real-World Transfer




Source Code and Pre-trained models

We have released the PyTorch implementation of Goal-Oriented Semantic Exploration system along with pre-trained models on GitHub. Try our code!
[GitHub]


Paper and Bibtex

[Paper]

Citation
 
Chaplot, D.S., Gandhi, D., Gupta, A. and Salakhutdinov, R. 2020.
Object Goal Navigation using Goal-Oriented Semantic Exploration.
In Neural Information Processing Systems (NeurIPS-20).

[Bibtex]
@inproceedings{chaplot2020object,
  title={Object Goal Navigation using Goal-Oriented Semantic Exploration},
  author={Chaplot, Devendra Singh and Gandhi, Dhiraj and
            Gupta, Abhinav and Salakhutdinov, Ruslan},
  booktitle={In Neural Information Processing Systems}
  year={2020}}
                


Acknowledgements

This work was supported by IARPA DIVA D17PC00340, US Army W911NF1920104, ONR Grant N000141812861, ONR MURI, ONR Young Investigator, DARPA MCS and Apple. We would also like to acknowledge NVIDIA’s GPU support.
Website template from here and here.